09/05/2022

Introduction

Exploring the database which was used to train Alphafold2

Introduction

The Research Collaboratory for Structural Bioinformatics - Protein Data Bank (RCSB-PDB)

  • An open archive of experimental 3D structures
  • Estimated 1 Million unique users annually

PDB Statitics - an example of data distribution found on https://www.rcsb.org/stats/

Materials and Methods

From raw data to visualizations

Results: Bar plots I

Results: Bar plots II

Results: Bar plots III

Results: Further Analysis I

pdb_taxa_mol <- taxonomy_df %>%
  group_by(SUPERKINGDOM, `MOLECULE TYPE`) %>% 
  add_tally(name = "n") %>% 
  distinct(SUPERKINGDOM, `MOLECULE TYPE`, n)

Results: Further Analysis II

pdb_entries_aug %>% 
  select(IDCODE, RESOLUTION, `EXPERIMENT TYPE`) %>% 
  filter(`EXPERIMENT TYPE` %in% exp_type_levels)

Results: Further Analysis III

  • The PDB grew exponentially at first
  • The growth seems to have reached a plateau in the early 2000’s

Discussion

  • Successfully improved the PDB metadata visualizations

  • Database updates compromise reproducibility

  • Greatest challenge: combining files from different sources

  • Further analysis to account for redundancy

Discussion

  • Successfully improved the PDB metadata visualizations

  • Database updates compromise reproducibility

  • Greatest challenge: combining files from different sources

  • Further analysis to account for redundancy

Discussion

  • Successfully improved the PDB metadata visualizations

  • Database updates compromise reproducibility

  • Greatest challenge: combining files from different sources

  • Further analysis to account for redundancy

Discussion

  • Successfully improved the PDB metadata visualizations

  • Database updates compromise reproducibility

  • Greatest challenge: combining files from different sources

  • Further analysis to account for redundancy

Acknowledgements